Resources and Introduction
Please Install the follwing packages if not already installed
Access lecture slide from bit.ly/aar-ug
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
What are Statistical Learning Techniques?
When to apply a given technique?
How to apply a given technique using R?
Ways to evaluate the performance of a technique (How well it serves your purpose ?).
All statistical techniques that exists
All the mathematics behind a statistical technique.
There is no one fixed textbook of this course.
But here are some resources I will be using to teach:
Data Wrangling.
Data Visualization.
Exploratory Data Analysis.
Fundamental Stats - random variables, summary statistics, probability distributions, etc.
For the sake of curiosity, in order to affect outcomes; We are interested in knowing how something works, why something happens and what will happen.
“Statistical learning refers to a vast set of tools for understanding data.”
“Broadly speaking,supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.”
“With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.”
“While this list is not exhaustive, most models fall into at least one of these categories:”
In order to predict, estimate or classify a variable of interest using other variables , we are attempting to find out a how these other variables provide systemic information about the variable of interest.
\[Y = f(X) + e\]
Y is the variable of interest.
f(X) is the function (Nemo) that provides systemic information about Y.
e is the error term independt of X.
The essence of statistical learning is to estimate f(x)
Either we need to predict or estimate some quantity.
True representation:
\[Y = f(X) + e\]
Model: (We are OK with f being typically unknown - blackbox )
\[\hat Y = \hat f(x)\]
\[E(Y - \hat Y)^2 = E(f(X) + e - \hat f(X) )^2\\ = [f(X) - \hat f(X)]^2 + Var(e) \] Reducible and irreducible error.
The goal is to minimize the reducible error.
The Var(e) sets the upper bound on accuracy of your predictions.
Either we need to predict or estimate some quantity.
Here instead of being worried about what will be Y we are more concerned about how are Y and X related. We can no more ignore the lack on knowledge about form of f(x)
We are increasingly concerned about the relationship of response variable and each independent variable.
f?Parametric
f\[f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n\]
or
\[f(X) = \beta_0 + \beta_1 X_1^2 + \beta_2 X_2^2 + ... + \beta_n X_n^2 \]
Use training data to fit the model. Ex : least square method to fit the first equation.
Now the goal is to find n+1 coeffs or parameters instead of estimating an arbitrary n dimensional function.
Disadvantage: Our choice of model might not potentially match the true form of the function. If these two are very different, quality of estimates is poor.
f?Non-Parametric
f.For quantitative response variables
\[ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2 \]
\[\hat y_i = \hat f(x_i)\]
Should you care about training MSE or test MSE?
Should you use training MSE if you don’t have test data?
For quantitative response variables
\[E(y_0 - \hat f(x_0))^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 + Var(e)\]
For qualitative response variables
\[\frac{1}{n}\sum_{i=0}^{n}(y_i \not = \hat y_i)\]
Read Chapters 1 and 2 of Intro to statistical learning with R